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to  complex  sounds  that  varied  in  amplitude-spectral  shape.  The  subjective 
'v  feature  representation  obtained  from  the  ALSCAL  nonmetric  scaling  program 
was  generally  consistent  with  the  theoretical  feature  representation  produced 
by  the  optimal  structure-preserving  transformation  applied  to  the  loudness- 
weighted  spectra.  The  two  comparison  features  as  well  as  the  relative  impor- 
tance of  the  two  dimensions  were  successfully  predicted  by  the  model.  Prac- 
tical implications  for  the  subjective  evaluation  of  complex  signals  are  dis- 
cussed and  refinements  to  the  transformations  in  the  model  are  suggested  for 
further  research. 
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A variety  of  recent  auditory  psychophysical  studies  have 
required  listeners  to  evaluate  the  subjective  similarity  of  two 
or  more  complex  acoustic  stimuli.  Such  studies  have  involved 


both  speech  (Shepard,  1 972)  and  complex  nonspeech  sounds  (Miller 
& Carterette,  1975;  Howard,  1977;  Grey  & Gordon,  1978).  It  is 
generally  assumed  that  the  similarity  ratings  obtained  in  this 
situation  reflect  the  outcome  of  a peceptual  comparison  based  on 
one  or  more  psychophysical  features  that  characterize  members  of 
the  stimulus  set.  Typically,  standard  metric  and  nonmetric 
multidimensional  scaling  techniques  are  used  to  extract  a set  of 
perceptual  dimensions  from  the  observed  matrix  of  similarity 
judgments.  The  dimensions  revealed  in  this  analysis  are  thought 
to  reflect  the  elementary  perceptual  units  or  features  that  the 
listeners  used  to  compare  the  stimuli.  An  important  implicit 
assumption  in  this  research  is  that  human  listeners  can  reliably 
report  the  perceived  similarity  among  sounds  even  though  they 
may  not  be  explicitly  aware  of  the  underlying  stimulus  features. 
As  Plomp  and  his  associates  have  indicated  (Plomp,  1976),  these 
methods  have  contributed  significantly  to  our  understanding  of 
the  processes  involved  in  timbre  perception. 

Th e Feature  Selection  Pr oblem 

The  specific  question  addressed  in  the  present  paper 
concerns  the  perceptual  features  listeners  use  to  make  pairwise 
similarity  judgments  on  a set  of  sixteen  complex  steady-state 
sounds  that  differ  primarily  in  timbre.  How  are  the  elementary 
units  of  comparison  determined?  What  criteria  do  listeners  use 
to  select  a subset  of  all  possible  dimensions  for  comparing  the 
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individual  members  of  the  stimulus  set? 

Howard  and  Balias  (1  978)  have  referred  to  this  as  the 
feature  selection  problem.  Two  contrasting  approaches  to  this 
problem  have  been  suggested  in  the  literature.  First,  the  human 
auditory  system  may  be  equipped  with  a set  of  specific  feature 
detecting  mechanisms  that  monitor  incoming  aural  information  for 
particular  stimulus  cues  (e.g.,  Barlow,  1972).  This  approach 
emphasizes  the  importance  of  the  feature  detectors  themselves. 

Each  detector  "looks  for"  an  individual  stimulus  property,  and  a 
set  of  feature  detectors  determines  a property  list  for  the 
stimulus.  Howard  and  Balias  (19  78)  referred  to  this  as  the 
property- list  approach.  Second,  it  is  possible  that  the 
auditory  system  has  an  internalized  set  of  rules  and  criteria 
for  feature  selection  rather  than  a set  of  finely-tuned  feature 
detectors.  These  rules  and  processes  enable  the  listener  to 
determine  what  the  comparison  features  should  be  in  any 

particular  stimulus  context.  This  view  was  called  the 
process-oriented  approach  (Howard  & Balias,  1978). 

Although  evidence  supporting  both  positions  can  be  found  in 
the  literature,  Howard  and  Balias  (1  978)  argue  that  the 
process-oriented  approach  is  more  naturally  suited  for 
theorizing  about  the  timbre  comparison  task.  While  it  may  be 
reasonable  to  argue  that  man  has  evolved  specialized  brain 

"filters"  for  certain  aural  cues  (e.g.,  speech  features),  an  * 

' 

extension  of  this  argument  to  include  detectors  for  the  »' 

I 

individual  timbre  attributes  of  complex  tones  resists 

y 

credibility.  Since  timbre  obviously  encompases  a large  set  of 
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perceptual  attributes  (Plomp,  1976;  von  Bismarck,  1974),  at  the 
very  least,  a prope rty- 1 is t approach  to  timbre  comparison  would 
suffer  an  embarrasing  lack  of  parsimony.  Furthermore,  if  we 
were  to  argue  that  only  a subset  of  detectors  would  be  used  in 
any  particular  comparison  task  then  we  would  still  be  obliged  to 
explain  how  that  subset  is  selected  by  the  listener. 
Consequently,  in  the  present  paper  we  adopt  a process-oriented 
approach  to  the  feature  selection  problem.  In  other  words, 
rather  than  searching  for  the  set  of  invariant  auditory  feature 
detectors  that  underlie  timbre  comparison,  we  will  attempt  to 
outline  some  general  principles  that  would  account  for  feature 
selection  in  a variety  of  comparison  contexts. 

Toward  £ Model  of  Feature  Selection 

In  their  recent  treatment  of  this  problem,  Howard  and 
Balias  (1978)  argue  that  when  asked  to  compare  the  timbre  of 
steady-state  sounds,  human  listeners  perform  a structural 
analysis  on  the  low-resolution  spectra  of  the  comparison 
stimuli.  In  this  case,  the  feature  selection  process  may  be 
thought  of  as  a structure  preserving  transformation  that  maps 
stimuli  from  an  initial  low-level  representation  (the 
measurement  representation)  onto  a higher-order  representation 
of  lower  dimensionality  (a  feature  representation).  In  the  case 
of  steady-state  complex  sounds  it  may  be  argued  that  the 
measurement  representation  is  approximated  by  a 1/3-octave 
spectral  analysis,  adjusted  for  unequal  sensitivity  across  the 
spectrum  (Zwicker,  Flottorp  & Stevens,  1957).  Although  in 
general  it  is  evident  that  information  will  be  lost  with  such  a 
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transformation/  Howard  and  Balias  (1978)  argue  that  listeners 
select  the  comparison  features  so  as  to  minimize  this  loss.  In 
other  words,  in  a comparison  task,  features  are  selected  to 
account  for  as  much  of  the  variability  among  the  measurement 
representations  of  the  stimuli  as  possible.  They  point  out  that 
a transformation  having  these  properties  is  very  similar  to  a 
principal  components  analysis. 

A principal  components  analysis  provides  a transformation 
that  maps  objects  from  one  space  into  a subspace  of  lower 
dimensionality.  The  first  principal  component  is  simply  a new 
axis  in  the  original  space  that  accounts  for  most  of  the 
variability  among  the  objects.  In  other  words,  the  set  of 
projections  of  objects  in  the  measurement  space  onto  the  first 
principal  component  has  maximum  variance.  The  second  principal 
component  is  an  axis  orthogonal  to  the  first  that  accounts  for 
most  of  the  residual  variance  and  so  on  (Harris,  1 9 75). 

Given  these  arguments,  we  can  construct  a preliminary  model 
of  the  stimulus  comparison  process  for  steady-state  complex 
sounds.  Figure  1 displays  an  outline  of  our  approach. 


Insert  Figure  1 here 

An  initial  measurement  transformation,  denoted  M , determines 
a measurement  representation  from  the  time-domain  stimuli.  ’ We 
assune  that  M reflects  a low-resolution  spectral  analysis, 
and  denote  the  measurement  representation  for  stimulus  S|  by  a 
column  vector  of  m 1/3-octave  band  levels,  X|  * M^\)>  where 


Figure  1.  Preliminary  three-stage  model  of  the  aural  comparison  process. 
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xj  = (x^  / x|2  ' • • • / X|m  ) • After  tn  is,  a second  transformation, 
T , occurs  that  extracts  a set  of  comparison  features  from  the 
measurement  vectors.  In  our  model  we  assume  that  the  outcome  of 
this  transformation  is  a column  vector  of  n feature  values  for 


That 


where 


each  stimulus.  That  is,  = T (x.),  where 

f j = (f  j,  ' fj2  , • . fjn)  with  n < m.  Once  the  feature 
information  is  available,  the  listener  compares  the  stimuli  to 
determine  a similarity  judgment,  Ci fr  fj). 

The  heart  of  the  feature  selection  problem  involves 
specifying  the  transformation  T . Following  Howard  and  Balias 
(1978),  we  have  argued  that  this  transformation  reflects  the 
outcome  of  a structural  analysis  of  the  stimulus  set,  much  like 
a principal  components  analysis.  This  assumption  specifies  four 
important  properties  of  the  transformation.  First,  the 
transformation  is  linear.  Since  the  features  represent  new 
dimensions  in  the  measurement  space,  the  transformation  must 
project  each  stimulus  in  the  measurement  space  onto  the  new 
dimensions.  The  feature  values,  _fj,  for  stimulus  S|  are 
therefore  weighted  linear  combinations  of  the  original 
measurements,  Xj.  In  matrix  notation,  each  vector  of  feature 
values  is  the  product  of  a measurement  vector  and  an  n by  m 
matrix  of  weights  or  coefficients,  T,  fj  * T Xj  , or 


tn  t12  • • • tim  \ lx  j1 


^21  ^22 


■2m  *i2 


fcm  tn2  * * * lnm  / \ x|m 
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In  this  view,  the  jth  feature  coordinate  for  stimulus  Sj , fjj  , 

is  determined  by  the  inner  product  of  the  jth  row  vector  of  T 

m 

and  the  measurement  vector  for  that  stimulus,  f = s = V*  t. .■  x.. 

k-i  )k  lk 

Second,  the  transformation  should  project  stimuli  from  the 
m-d im ens ional  measurement  space  onto  the  n-d imens iona 1 feature 
space  while  preserving  as  much  of  the  original  information  as 
possible.  This  is  achieved  by  selecting  transformation 
coefficients,  T,  such  that  the  variance  of  stimulus  projections 
onto  each  dimension  is  maximal. 

Third,  the  transformation  coefficient  vector  for  each 
feature  (i.e.,  each  row  in  the  T matrix)  should  be  of  unit 
length.  This  restriction  is  required  to  avoid  trivially 
satisfying  the  second  condition  by  selecting  arbitrarily  large 
coefficients . 

Fourth,  the  transformation  coefficient  vectors  should  be 
mutually  orthogonal.  Since  the  primary  function  of  the  feature 
transformation  is  to  eliminate  redundancy  in  the  measurement 
representations,  it  is  obviously  desirable  that  the  features 
carry  as  little  overlapping  information  as  possible.  Together 
with  the  third  condition,  this  specifies  that  the  vectors  of 
projection  coefficients  be  orthonormal,  i.e.,  orthogonal  and  of 
unit  length. 

Transformations  having  these  properties  are  frequently 
encountered  in  the  theoretical  pattern  recognition  literature, 
and  represent  a particular  instance  of  the  discrete 
Ka rhunen-Loeve  expansion  (Meisel,  1972;  Young  & Calvert,  1974). 
Fortunately,  the  desired  transformation  coefficients  are  readily 
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obtained  by  decomposing  the  symmetric  m by  m covariance  matrix 
of  stimulus  measurements  (in  our  case  the  1/3-octave  band 
levels)  using  standard  techniques.  The  normalized  eigenvectors 
resulting  from  this  decomposition  provide  the  transformation 
coefficients,  and  the  corresponding  eigenvalues  indicate  the 
relative  importance  of  each  eigenvector.  An  optimal 
n-d imens ional  feature  space  may  then  be  determined  by  selecting 
the  n eigenvectors  having  the  largest  eigenvalues. 

More  specifically,  to  decompose  the  covariance  matrix  we 
need  to  solve  the  well-known  eigenvalue  problem 
JS  £ j = cxj  e.  i = 1,  2,  . . . , m. 

where  K represents  the  covariance  matrix,  jejj.  represents  a set 
of  m orthogonal  solution  vectors,  called  eigenvectors,  and 
| } are  a set  of  m associated  scalars  called  eigenvalues 

(Green  & Carroll,  197  6).  In  the  present  context,  the 
eigenvectors  indicate  the  new  dimensions  in  the  feature  space 
and  the  ith  eigenvalue  reflects  the  variability  of  stimulus 
projections  onto  the  ith  feature  dimension.  Although  m 
eigenvectors  exist  for  an  m by  m covariance  matrix,  a more 
efficient  stimulus  representation  can  be  obtained  by  discarding 
the  eigenvectors  that  account  for  relatively  little  of  the 
stimulus  variability.  To  the  extent  that  redundancy  exists  in 
the  original  measurements,  the  information  in  the  stimulus  can 
be  adequately  portrayed  with  fewer  dimensions  in  the  feature 
space  than  in  the  measurement  space  (i.e.,  n < m in  the  notation 
developed  above).  Once  we  have  selected  the  n eigenvectors, 
these  values  determine  the  coefficients  in  the  transformation 
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matrix  T'  i .e . , 

1 = 

Rat ionale 

In  the  present  paper  we  will  examine  the  above  model  as  a 
characterization  of  the  feature  selection  process  for  human 
listeners  in  a timbre  comparison  task.  Since  it  is  well  known 
that  the  shape  of  the  amplitude  spectrum  is  the  primary  physical 
correlate  of  timbre  (Plomp,  1976),  any  model  that  describes  the 
feature  selection  process  must  account  for  its  effects  on  the 
psychological  feature  representation.  Because  we  are  primarily 
interested  in  the  transformations  involved  in  timbre  perception, 
sixteen  complex,  steady-state  sounds  that  differ  in  spectral 
shape  will  be  used  in  the  present  experiment.  The  sounds  were 
synthesized  by  combining  individual  sinusoidal  components  at 
1/3-octave  intervals.  These  intervals  were  selected  since  it  is 
generally  accepted  that  the  ear  resembles  a set  of  1/3-octave 
filters  in  its  frequency  resolving  power  (Plomp,  1976).  The 
amplitude  spectra  were  shaped  by  combining  the  components  at 
various  amplitudes.  All  sixteen  sounds  had  two  spectral  peaks 

> 

or  formants  of  differing  peak  ratio  and  distinctiveness. 

f 

The  set  of  sixteen  complex  sounds  will  be  presented  to 
listeners  for  pairwise  similarity  judgments.  A measurement 
vector  (x .)  will  be  obtained  for  each  sound  by 
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amplitude-weighting  the  1/3-octave  band  levels  using  Steven's 
loudness  function  (Stevens,  1972).  These  loudness-adjusted 
spectra  will  be  analyzed  according  to  the  procedures  outlined 
above  to  determine  an  optimal  structure-preserving  feature 
transformation.  The  feature  representation  predicted  by  this 
theoretical  analysis  will  be  compared  to  the  subjective  feature 
representation  observed  in  the  experiment  to  test  the  adequacy 
of  the  model. 

The  feature  representation  actually  used  by  the  listeners 
will  be  estimated  by  submitting  the  observed  similarity  matrices 
to  a nonmetric  multidimensional  scaling  analysis.  In 
particular,  the  ALSCAL  program  (Takane,  Young  & de  Leeuw,  1977) 
will  be  used  to  decompose  the  data  into  an  n-d imens ional  metric 
space  in  which  each  stimulus  is  represented  as  a single  point  or 
vector.  The  dimensions  revealed  in  this  analysis  will  be  taken 
to  reflect  those  features  that  the  listeners  employed  to  compare 
the  sounds. 

Method 

Subjects 

Six  undergraduate  student  volunteers  (5  males  and  1 female) 
were  paid  an  average  of  $3.00  per  hour  for  their  participation. 
All  students  had  some  musical  background;  however,  none  had 
taken  formal  training  in  the  last  three  years.  The  volunteers 
reported  no  history  of  hearing  disorders 
Apparatus 

All  experimental  events  were  controlled  by  a Digital 
Equipment  Corporation  PDP-8/e  computer.  Statistical  analyses 
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were  carried  out  on  the  Catholic  University's  DECSys  tem-10 
computer  using  the  IMSL  statistical  library,  and  the  ALSCAL 
multidimensional  scaling  program  (Takane  et  al. , 1977). 

Listeners  were  isolated  in  a sound-attenuated  booth  during 
the  experiment.  A video  display  was  used  to  present  verbal 
feedback  and  instructions,  and  listeners  entered  their  responses 
on  a solid-state  keyboard.  A 12-bit  digital-to-analog  converter 
(Digital  Equipment  Corporation  AA5  0)  was  used  to  output  the 
complex  auditory  waveforms  at  a sampling  rate  of  10  kHz. 
Synthesized  waveforms  were  low-pass  filtered  (Krohn-Hite  Model 
3550)  with  an  upper  cutoff  frequency  of  4 kHz  to  remove  aliasing 
frequencies.  The  sounds  were  passed  through  a programmable 
attenuator  (Texscan  PA-50)  before  being  presented  over  matched 
headphones  (Telephonies  TDH-49,  MX41/AR  cushions). 

St imuli 

Sixteen  complex  steady-state  sounds  were  constructed 
digitally  by  adding  together  22  individual  sinusoidal 
components.  As  indicated  above,  these  components  were  spaced  at 
1/3-octave  intervals  between  20  and  2500  Hz.  Two  parameters, 
peak  ratio  and  peak  smear  were  varied  to  produce  amplitude 
spectra  of  different  shapes.  The  resulting  spectra  had  maxima 
at  5 00  and  1 000  Hz  with  peak  amplitude  ratios  of  1.00,  .9  0,  . 80, 

or  .70  on  a logarithmic  scale.  The  amplitudes  of  the  remaining 
frequency  components  were  determined  by  Gaussian  distributions 
centered  at  the  two  peak  frequencies.  Peak  smear  was 
manipulated  by  varying  the  standard  deviation  of  the 
distributions  (50,  1 00,  200,  or  4 00  Hz).  Thus,  the  spectra  had 
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two  distinct  peaks  with  small  standard  deviations,  but  appeared 
smeared  with  large  standard  deviations.  It  should  be  noted  that 
the  two  parameters  are  not  orthogonal  since  either  a low  peak 
ratio  or  a large  standard  deviation  would  produce  a more  uniform 
spectrum.  The  four  extreme  spectra  produced  by  the  combination 
of  these  two  dimensions  are  displayed  in  Figure  2,  and  the 
physical  parameters  for  each  stimulus  are  presented  in  Table  1. 


Insert  Figure  2 and  Table  1 here 


The  stimuli  were  equated  subjectively  for  loudness  by  a 
preliminary  group  of  listeners  who  did  not  participate  in  the 
experiment.  The  loudness-equated  sounds  were  presented  at 
levels  of  between  76  and  78  dB  SPL. 

Procedure 

Participants  were  seated  in  the  sound-attenuated  booth  and 
were  given  typewritten  instructions.  After  the  listeners 
understood  the  instructions,  the  complex  set  of  sixteen  sounds 
were  presented  four  times  in  order  to  familiarize  the  person 
with  the  sounds  they  were  to  compare.  The  listener  was 
instructed  that  he  or  she  was  to  compare  the  stimuli,  and  assign 
a rating  of  "5"  if  the  two  sounds  were  very  similar  or  a rating 
of  "1"  if  the  sounds  were  very  dissimilar.  The  ratings  between 
1 and  5 were  to  be  used  for  pairs  of  intermediate  similarity. 
After  the  initial  familiarization  period,  the  listeners 
participated  in  the  comparison  task  for  three  days  in  one-hour 
sessions.  At  the  end  of  the  third  day,  a brief  sound-sorting 
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Figure  2.  Four  extreme  spectra  (sounds  1,  4,  13,  and  16)  produced  by 
the  combination  of  the  peak  ratio  and  peak  smear  parameters. 
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Table  1 

Physical  Parameter  Values  Used  to  Generate  Each  of  the  Sixteen  Test  Sounds. 
Peak  Ratio  Refers  to  the  Amplitude  in  dB  of  the  1000  Hz  Peak  Relative 
to  the  500  Hz  Peak.  Peak  Smear  is  Expressed  in  Standard  Deviation 

Units  (Hz). 


Sound 

Peak  Ratio 

Peak 

1 

1.00 

50 

2 

1.00 

100 

3 

1.00 

200 

4 

1.00 

400 

5 

.90 

50 

6 

.90 

100 

7 

.90 

200 

8 

.90 

400 

9 

.80 

50 

10 

.80 

100 

11 

.80 

200 

12 

.80 

400 

13 

.70 

50 

14 

.70 

100 

15 

.70 

200 

16 

.70 

400 

p 
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task  was  given  in  which  the  listener  had  to  order  the  sounds 
from  lowest  to  highest  pitch  by  making  pairwise  judgments.  The 
participants  were  not  informed  of  the  pitch-sorting  task  until 
after  the  third  scaling  session.  The  pitch-sorting  task  was 
included  to  assess  the  possible  role  of  pitch  in  the  similarity 
d a ta . 

Each  trial  in  the  similarity  rating  task  began  when  the 
word  LISTEN  appeared  on  the  video  display.  After  a brief  delay, 
successive  three  second  samples  of  the  comparison  sounds  were 
presented  with  a one  second  in te rst imul us  interval.  After  the 
second  stimulus  was  presented,  the  words  RATE  SIMILARITY  were 
displayed.  Listeners  were  allowed  unlimited  time  to  make  their 
response;  however,  most  responded  within  four  seconds.  After 
the  listener  reponded,  the  display  was  cleared  and  the  next 
trial  began.  Each  of  the  120  possible  stimulus  pairs  were 
presented  twice,  counterbalanced  for  order  of  presentation. 
This  procedure  was  repeated  on  each  of  the  three  successive 
days . 


At  the  end  of  the  third  day,  listeners  participated  in  the 
pitch-sorting  task.  Before  beginning,  each  of  the  sixteen 
sounds  was  played  to  review  the  entire  set.  On  each  rating 
trial,  the  participant  saw  the  word  LISTEN  followed  by  a 
stimulus  pair,  and  then  the  words  WHICH  SOUND  WAS  LOWEST  IN 
PITCH  (I.E.,  MORE  BASS  SOUNDING)?,  were  displayed.  The  listener 
then  pressed  "1"  if  the  first  sound  was  lower  than  the  second, 
or  "2"  if  the  second  was  lower  than  the  first.  The  listener 


could  repeat  the  trial  by  pressing 


a 


key  marked 


S 


A 


Feature  Selection 


Page  16 


bubble-sort  algorithm  was  employed  to  sort  the  sounds  using  the 
pairwise  pitch  ratings.  After  the  sort  was  complete,  the 
listener  heard  all  of  the  sounds  in  the  pitch  ordering  that  he 
had  determined.  If  the  listener  was  not  satisfied  with  this 
ordering,  the  above  task  could  be  repeated.  However,  all 
listeners  required  only  one  pass  to  achieve  a satisfactory 
sorting.  Sound  pairs  were  presented  in  a different  random  order 
for  each  listener. 


Results  and  Discussion 

Th  eoret ica 1 Analysis 

A predicted  feature  representation  was  obtained  by  applying 
the  model  to  the  22-element  measurement  vectors  (x.)  for  the 
sixteen  sounds.  As  indicated  in  the  introduction,  the  predicted 
features  are  simply  principal  component  axes  obtained  in  an 
eigen-analysis  of  the  measurement  covariance  matrix.  The 
variance  acconted  for  by  each  axis  or  feature  is  given  by  the 
corresponding  eigenvalue.  In  the  present  case,  the  first  two 
principal  components  accounted  for  918  of  the  overall  stimulus 
variability  (74%  and  17%  for  the  first  and  second  principal 
components,  respectively).  Since  the  third  principal  component 
accounted  for  less  than  6%  of  the  overall  variance,  it  was  not 
considered  further.  This  analysis  indicates  that  listeners  need 
only  use  two  comparison  features  to  account  for  most  of  the 
variability  in  the  present  stimuli. 

The  normalized  transformation  coefficients  obtained  in  this 
analysis  are  displayed  in  Table  2.  As  indicated  in  the 
introduction,  the  feature  projections  for  any  stimulus  are 
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obtained 


f rom 


the 


transformation  matrix 


*; 

ti 


Th  e left 


sixteen  stimuli  in  the 


matrix  equation/  f j * T Xj,  where  the 
is  given  by  the  coefficients  in  Table  2, 
half  of  Figure  3 displays  a plot  of  the 
predicted  two-dimensional  feature  space. 


Insert  Figure  3 and  Table  2 here 


It  is  obvious  from  an  examination  of  Figure  3 that  the 
first  principal  component  (Dimension  1)  is  related  to  the  peak 
smear  parameter  used  to  generate  the  stimuli.  Stimuli  having 
the  least  smear  (1/  5,  9,  13)  appear  at  the  far  right  along  this 
dimension/  whereas  stimuli  having  the  greatest  smear  (4,  8,  12/ 
16)  appear  on  the  extreme  left.  It  is  also  evident/  however/ 
that  stimuli  having  the  same  peak  smear  do  not  have  identical 
Dimension  1 coordinates.  For  example/  sounds  3,  7,  11,  and  15 
were  all  synthesized  with  a standard  deviation  of  200  Hz,  but 
have  differing  coordinates  along  this  dimension.  It  appears, 
then,  that  although  Dimension  1 is  determined  primarily  by  peak 
smear,  it  also  depends  on  the  peak  ratio  parameter. 

A more  complete  description  of  this  predicted  feature  may 
be  obtained  by  examining  the  t,  coefficient  vector  in  Table  2. 
Since  the  Dimension  1 projection  for  any  stimulus  is  simply  a 
weighted  linear  combination  of  its  22  band-level  measurements, 
the  coefficients  indicate  the  relative  importance  of  each 
individual  band  level.  For  Dimension  1,  the  two  frequency  bands 
lying  between  the  500  and  1000  Hz  peaks,  630  and  800  Hz,  have 
the  largest  coefficients.  This  is  generally  consistent  with  our 
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Table  2 

Normalized  Transformation  Coefficients  (x  105)  for  Each  Frequency  Component 
Obtained  in  a Theoretical  Analysis  of  the  Sixteen  Sounds.  The  Two  Co- 
efficient Vectors,  t^  and  t ? Form  the  Predicted  Transform  Matrix  T.a 


Component 

Frequency  (Hz) 

-1 

—2 

1 

20 

-38 

-1 

2 

25 

-39 

-1 

3 

31.5 

-40 

-1 

4 

40 

-41 

0 

5 

50 

-42 

0 

6 

63 

-44 

1 

7 

80 

-46 

2 

8 

100 

-55 

5 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 


125 

-75 

160 

-98 

200 

-129 

250 

-144 

315 

-174 

400 

-206 

500 

-- 

530 

800 

|-236  J 
|-21S  | 

1000 

-66 

1250 

-136 

1600 

-44 

2000 

-8 

10 

21 

39 

63 

114 


22 


2500 


0 ' 0 


obsg^omgonent  amplitudes  were  equated  at  the  500  Hz  peak, hence. 


zero  coefficients  were 


Feature  Selection 


Page  20 


observation  that  Dimension  1 is  most  closely  related  to  the  peak 
smear  parameter.  However,  the  amplitude  of  components  lying 
between  the  two  peaks  will  be  determined  by  both  peak  ratio  and 
peak  smear.  These  values  may  be  increased  by  either  (1) 
increasing  the  smear,  as  in  moving  from  sound  1 to  sound  4,  or 
(2)  increasing  the  peak  ratio,  as  in  moving  from  sound  15  to 
sound  3.  In  summary,  the  model  predicts  that  listeners  will 
focus  on  the  intensity  of  components  between  the  two  peaks  as  a 
primary  comparison  feature.  It  is  interesting  to  note  in  this 
context  that  a similar  "peak  distinctiveness"  perceptual  feature 
was  described  by  Howard  ( 1977)  in  an  earlier  psychophysical 
investigation  of  complex  sounds. 

When  the  second  dimension  is  considered,  a similar  picture 
emerges.  Examination  of  Figure  3 clearly  indicates  that  the 
stimulus  coordinates  along  this  dimension  are  determined  by  an 
interaction  of  the  peak  ratio  and  peak  smear  parameters. 
Although  the  peak  ratio  rank  ordering  is  maintained  within 
groups  of  four  stimuli,  the  absolute  Dimension  2 coordinate 
depends  on  peak  smear  as  well.  As  with  Dimension  1,  a clearer 
understanding  of  this  feature  may  be  obtained  by  examining  the 
t^2  coefficient  vector  in  Table  2.  It  is  interesting  that 
frequencies  below  4 00  Hz  contribute  very  little  to  this  feature 
value.  In  contrast,  the  bands  adjacent  to  the  500  Hz  peak,  400 
and  630  Hz  have  large  positive  coefficients,  and  the  1000  and 
1 250  Hz  bands  have  very  large  negative  weights.  When  these 
coefficients  are  applied  to  transform  the  1/3-octave  spectra  for 
our  sounds,  a "relative  pitch"  dimension  emerges.  In 
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particular,  sounds  having  relatively  greater  high  frequency 
energy  (i.e.,  1000  and  1250  Hz  region)  will  produce  large 
negative  coordinates  for  this  feature.  For  example,  since  sound 
5 has  a lower  peak  ratio  than  sound  1,  it  is  clear  that  it  will 
have  relatively  less  energy  in  the  high  frequency  region.  When 
sound  2 is  compared  to  sound  1,  however,  we  must  consider  the 
role  of  peak  smear.  In  this  case,  the  100  Hz  standard  deviation 
used  to  smear  the  peaks  in  sound  2 effectively  increases  the  low 
frequency  energy  relative  to  the  high  frequency  energy.  This 
occurs  because  of  the  wider  1/3-octave  intervals  in  the  high 
frequency  region.  In  contrast,  when  sound  4 is  compared  to 
sound  2,  the  broader  peak  smearing  used  for  sound  4 (400  Hz 
standard  deviation)  also  increases  the  amplitude  of  the  more 
heavily  weighted  1 250  Hz  component.  The  net  result  is  that  the 
Dimension  2 coordinate  for  sound  4 is  somewhat  more  negative 
than  that  for  sound  2.  Subjectively,  we  can  say  that  listeners 
are  expected  to  compare  the  overall  pitch  of  the  stimuli  within 
the  four-stimulus  clusters  along  Dimension  1.  The  expected 
interaction  of  peak  smear  and  peak  ratio  is  clearly  evident  in 
the  inverted  "U"  distribution  of  stimuli  in  the  predicted 
feature  space. 

To  summarize,  the  feature  selection  model  predicts  that 
listeners  will  need  only  two  comparison  features  to  adequately 
peform  the  similarity  judgment  task  for  these  stimuli.  More 
specifically,  we  expect  the  intensity  of  inter-peak  components 
or  peak  distinctiveness  to  be  particularly  important  (Dimension 
1).  The  second,  but  less  important  feature  should  reflect  the 
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relative  amount  of  high  versus  low  frequency  energy  in  the 
sounds. 

Peceptual  Analysis 

A full  16  by  16  matrix  of  similarity  judgments  was  obtained 
from  each  listener  on  each  of  the  three  sessions.  The  first 
session  was  viewed  as  practice,  and  these  data  were  not 
considered  further.  Data  were  summed  across  the  remaining  two 
sessions  to  yield  a single  proximity  matrix  for  each  listener. 
These  data  were  checked  for  consistency  by  computing  a Pearson 
product-moment  correlation  between  the  upper  and  lower  halves  of 
each  matrix.  The  average  correlaion  was  .7  0,  with  five  of  the 
six  listeners  showing  a correlation  of  .64  or  better.  This  was 
taken  to  indicate  that  the  listener's  similarity  judgments  were 
sufficiently  stable  to  justify  further  analysis. 

The  summed  matrix  for  each  individual  was  submitted  to  a 
nonmetric  individual  differences  ALSCAL  analysis.  The  selected 
nonmetric  scaling  model  required  that  we  only  assume  ordinal 
level  measurement  in  the  initial  subjective  proximity  matrices. 
In  addition,  the  individual  differences  model  provides  a 
saliency  vector  for  each  listener  that  indicates  the  relative 
importance  of  each  dimension  for  that  person.  The  latter 
property  will  enable  us  to  assess  individual  listener 
cons  is  tency . 

The  two-dimensional  ALSCAL  solution  provided  an  adequate 
representation  of  the  subjective  similarity  data.  Although  the 
observed  stress  (18.6%)  was  only  in  the  "fair"  range  according 
to  Kruskal  (1964),  the  addition  of  a third  dimension  resulted  in 
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little  improvement  in  either  stress  (5?)  or  in tepr etabil ity.  In 
addition.  Young  (1970)  has  pointed  out  that  for  proximity  data 
containing  any  sampling  error,  stress  tends  to  increase  with  the 
number  of  data  points.  This  occurs  despite  the  fact  that  the 
scaling  solution  may  actually  recover  most  of  the  underlying 
metric  information.  The  stimulus  space  obtained  in  our  scaling 
analysis  is  displayed  in  the  right  half  of  Figure  3.  These  data 
will  be  discussed  in  terms  of  the  theoretical  predictions 
outlined  above. 

The  finding  that  a two-dimensional  scaling  solution  was 
adequate  suggests  that  our  listeners  employed  two  comparison 
features.  This  was  predicted  in  our  theoretical  analysis.  It 
was  further  predicted  that  the  two  dimensions  would  differ 
widely  in  their  relative  importance.  Theoretically,  Dimension  1 
accounted  for  74?  of  the  stimulus  variability  and  Dimension  2 
accounted  for  only  17?  of  the  variability.  A similar  result  was 
observed  in  our  scaling  analysis  of  the  perceptual  data.  All 
six  listeners  placed  relatively  greater  emphasis  on  one 
dimension  (Dimension  1 in  Figure  3)  than  on  the  other. 

An  initial  visual  comparison  of  the  theoretical  and 
observed  feature  spaces  in  Figure  3 reveals  both  similarities 
and  differences.  First,  it  is  apparent  that  the  overall 
configuration  of  stimuli  is  similar  in  the  two  spaces.  In  both 
cases,  the  stimuli  have  an  inverted  "U"  distribution,  albeit 
more  pronounced  in  the  subjective  space,  and  the  stimulus 
projections  onto  the  two  axes  are  generally  comparable.  The 
Pearson  product-moment  correlations  between  the  corresponding 
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coordinates  in  the  two  spaces  are  consistent  with  this 
observation  (r  = .94,  £(14)  = 9.91,  £ < .01,  and  r = .82, 
£(14)  = 5.46,  £ < .01  for  Dimensions  1 and  2,  respectively). 
The  actual  theoretical  and  observed  coordinates  for  both 
dimensions  are  presented  in  Table  3. 


Insert  Table  3 here 


We  may  conclude,  therefore,  that  the  feature  selection  model 
outlined  above  successfully  predicted  the  signal  attributes  that 
listeners  would  use  to  compare  the  sounds. 

However,  despite  this  overall  consistency,  a number  of 
important  differences  exist  that  deserve  further  comment.  With 
regard  to  Dimension  1,  it  is  obvious  in  Figure  3 that  listeners 
did  not  clearly  distinguish  the  four  stimulus  clusters  predicted 
by  the  model.  Rather,  the  listeners  tended  to  dichotomize 
stimuli  along  this  dimension,  maintaining  a large  peceptual 
difference  between  the  low  peak  smear  stimuli  (50  and  100  Hz 
standard  deviation)  and  the  high  peak  smear  stimuli  (200  and  400 
Hz  standard  deviation).  It  is  interesting  to  note,  however, 
that  the  predicted  between-cl us  ter  rank  orderings  are  observed 
in  all  cases.  Mean  Dimension  1 projections  of  1.28,  .64,  -.82, 

and  -1.09  were  observed  for  the  four  expected  clusters 
(1-5-9-13,  2-6-10-14,  3-7-11-15,  and  4-8-12-16,  respectively), 

and  in  no  instance  did  the  clusters  overlap  along  this 
dimension.  It  appears,  then,  that  our  listeners  made  somewhat 
cruder  stimulus  distinctions  in  the  peak  distinctiveness  feature 
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Predicted  and  Obtained  Psychological  Coordinates 

for  Each  of 

the  Sixteen  Complex  Sounds 

Predicted 

Dimensions 

Observed 

Dimensions 

' Sound 

1 

2 

1 ■ 

2 

1 

1.11 

-1.87 

1.24 

-1.46 

2 

.45 

-1.42 

1.02 

- .78 

3 

- .74 

-1.37 

- .63 

-1.01 

4 

-1.70 

-1.59 

. .99 

-1.36 

5 

1.20 

- .45 

1.28 

- .91 

, 6 

.60 

.03 

.70 

.83 

7 

- .39 

.32 

- .87 

.29 

8 

-1.34 

.30 

-1.13 

- .99 

9 

1.25 

.20 

1.34 

.10 

10 

.66 

.70 

.42 

1.40 

n 

- .30 

1.09 

- .92 

.88 

12 

-1.25 

.53 

-1.09 

- .22 

13 

1.27 

.51 

1.25 

.53 

14 

.70 

1.02 

.41 

1.70 

i 

15 

- .26 

1.44 

- .85 

1.32 

, 16 

-1.24 

.55 

-1.18 

- .32 

\ 

| 

]J 
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than  were  predicted  by  the  model. 

Another  discrepancy  of  interest  was  observed  in  Dimension 
2.  Here,  the  interaction  of  the  peak  ratio  and  peak  smear 
parameters  was  stronger  than  expected  theoretically.  In 
particular,  the  low  frequency  dominant  sounds  (10,  14,  11,  and 
15)  were  more  clearly  distinguished  from  the  high  frequency 
dominant  sounds  than  expected.  This  difference  would  occur  if 
the  listeners  gave  the  lower  spectral  region  (i.e.,  adjacent  to 
the  5 00  Hz  peak)  greater  weight  than  indicated  by  the  predicted 
transformation  coefficients  in  Table  2.  This  result  would  be 
expected  if  perceptual  masking  effects  are  considered. 

To  obtain  additional  information  on  the  subjective 
properties  of  this  feature,  the  results  of  the  pitch  ranking 
task  were  examined.  In  this  task  listeners  performed  a pairwise 
pitch  sorting  of  the  sounds.  If  Dimension  2 in  the  scaling 
solution  reflects  overall  stiir.ulus  pitch,  then  the  pitch  ranking 
obtained  in  the  sorting  task  should  correspond  to  the  Dimension 
2 coordinate  ranking.  Since  the  six  listeners  produced 

generally  consistent  rank  orderings  (coefficient  of  concordance 

2 

W = . 8 7,  X ( 1 5 ) = 77.9  4,  p < .001),  a rank  of  summed  ranks  was 
determined  for  each  stimulus.  Of  interest  here  was  the  finding 
that  all  eight  low  peak  smear  stimuli  (i.e.,  generated  with  50 
or  100  Hz  standard  deviations)  were  ranked  lower  in  pitch  than 
the  eight  high  smear  stimuli.  This  is  clearly  inconsistent  with 
the  observed  Dimension  2 projections.  It  is  important  to  note, 
however,  that  only  the  high  smear  sounds  had  any  significant 
energy  at  frequencies  beyond  1000  Hz  because  of  the  wider 
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component  spacing  in  this  region.  Our  listeners  were  sensitive 
to  this  high  frequency  energy,  and  assigned  these  sounds 
appropriately  higher  pitch  rankings.  When  the  low  and  high 
smear  sounds  are  considered  separately,  the  within  set  orderings 
correspond  reasonably  well  to  the  Dimension  2 projection  ranks 
in  the  perceptual  feature  space  as  may  be  seen  in  Table  4. 


Insert  Table  4 here 

This  observation 

was  confirmed  by 

s ign if ica  n t 

Spea  rrr.an 

correlations  for 

both  sound  clusters 

(r  = 

.9  8,  £ 

< .01  and 

£=  .71,  £<  .05 

for  the  low  and 

h igh 

smea  r 

sounds , 

respectively).  This  finding  is  consistent  with  our  theoretical 
analysis  in  indicating  that  Dimension  2 reflects  pitch  in  a 
relative  rather  than  absolute  sense. 

Summary  and  Co ncl us  ions 

The  perceptual  data  considered  above  were  generally 
consistent  with  the  predictions  of  our  feature  selection  model. 
The  model  successfully  predicted  the  two  comparison  features 
that  the  listeners  used  to  generate  their  pairwise  similarity 
ratings.  In  addition,  it  was  able  to  predict  the  relative 
importance  of  these  two  dimensions.  This  suggests  that  our 
theoretical  assumptions  about  the  listener's  feature  selection 
criteria  were  reasonable.  The  model  proposes  that  listeners 
peform  a structural  analysis  of  the  variability  in  the  stimulus 
set,  and  select  features  that  enable  them  to  retain  as  much  of 
this  variability  as  possible  while  eliminating  redundancy. 
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Table  4 

Rank  Order  of  Low  Smear  and  High  Smear  Sounds  Observed  in  the  Pitch  Ranking 
Task  and  Multidimensional  Scaling  Solution  (Dimension  2). 


Sound 

Low  Smear 

Pitch  Rank 

Sounds 

Seal ing 

Sound 

High  Smear  Sounds 

Pitch  Rank 

Scaling 

1 

1 

1 

3 

2 

2 

2 

4 

3 

4 

1 

1 

5 

2 

2 

7 

3.5 

6 

6 

6 

6 

8 

3.5 

3 

9 

3 

4 

11 

5 

7 

10 

7 

7 

12 

6 

5 

13 

5 

5 

15 

7.5 

8 
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Simply  put/  the  listeners  were  doing  exactly  what  we  expected 
them  to  do--that  is/  compare  the  sounds--in  a statistically 
efficient  manner. 

This  interpretation  is  consistent  with  the  process-oriented 
approach  to  auditory  feature  selection  (Howard  & Balias,  1 978). 
It  asserts  that  the  most  reasonable  questions  to  ask  about 
timbre  peception  should  address  the  feature  selection  process 
that  listeners  use  rather  than  the  feature  detectors  that  they 
use.  Indeed,  it  is  entirely  possible  that  specific  timbre 
features  do  not  exist  in  any  absolute  sense.  The  invariant  and 
predictable  aspect  of  timbre  perception  may  well  involve  a set 
of  rules  and  criteria  that  specify  a flexible  feature  selection 
process.  It  has  been  our  objective  here  to  investigate  this 
pos  s ib ll i ty . 

Although  the  present  model  enjoyed  some  success  in 
predicting  the  general  characteristics  of  the  perceptual  feature 
space,  a number  of  difficulties  exist.  In  particular,  we  noted 
that  the  fine  structure  or  distribution  of  stimuli  within 
dimensions  was  not  well  handled  by  the  model.  This  short-coming 
will  hopefully  be  eliminated  as  we  are  able  to  develop  a more 
precise  specification  of  the  proposed  transformations.  At 
present,  for  example,  we  summarize  the  contribution  of  the 
auditory  periphery  by  a loud nes s- we igh ted  1/3-octave  spectral 
analysis.  Although  reasonable  as  a first  approximation,  masking 
and  other  known  peripheral  effects  must  be  considered. 
Similarly,  we  must  clarify  the  role  of  attentional  bias  in  the 
feature  selection  process,  specify  an  appropriate  measurement 
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space  for  transient  or  time-varying  signals/  and  indicate  how 
the  proposed  structural  analysis  takes  place  on  a tr  ial-by- 1 r ia  1 
basis.  These  and  other  questions  clearly  call  for  additional 
research . 

Finally,  it  is  important  to  recognize  that  the  present 
research  has  a number  of  important  practical  implications  beyond 
the  theoretical  issues  discussed  above.  Once  specified,  a 
feature  selection  model  will  enable  us  to  predict  a pr  ior  i the 
features  or  sources  of  variation  that  listeners  will  use  to 
evaluate  complex  aural  signals.  Once  the  feature  structure  is 
known,  the  confusability  of  specific  stimuli  can  be  anticipated. 
In  a context  where  this  information  is  important,  e.g.,  in  the 
classification  of  aural  sonar  signatures,  preprocessors  or  other 
performance  aids  may  be  introduced  to  reduce  item,  confusability 
as  required. 
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